﻿ Curs 1 Tehnologii de nivel înalt aplicate textului Dan Cristea Cât de departe poate merge mașina în înțelegerea limbajului? Capacitatea de a utiliza limbajul natural • Proba supremă: “înțelegerea” textelor => capacitatea de a reacționa corect la mesajul codificat în text – niveluri de analiză: • lexical • morfologic • sintactic • semantic • discurs • pragmatic How to extract the content of texts? • Content (semantic) = the objective knowledge, that which can be similarly identified by a large collectivity of humans • Understanding language puts to work a diversity of linguistic backgrounds (innate, acquired): – phonological, morphological, lexical See the Piaget ó Chomsky ó acquired – syntactic debate: innate – semantic All these layers must be – discourse reproduced on machine – pragmatic Modules like this one can be organised in chains Language independent module Language specific resources Modules like this one can be organised in chains Language independent module APPROACHES: Language specific symbolic resources statistical neural Modules like this one can be organised in chains Language independent module corpora treebanks wordnets verbnets language models (neural, statistical) … APPROACHES: Language specific symbolic resources statistical neural Creation of linguistic resources corpora treebanks wordnets verbnets language models (neural, statistical) … Language specific resources Cum se obțin resursele? Pasul 1: extragerea experzei umane text text adnotat Cum se obțin resursele? Pasul 2: sinteza modelelor text Program de set de învățare reguli Un modul Prelucrare independentă de limbă Resurse dependente de limbă Exemplu: un parser sintacc Parser: software independent de limbă set de reguli sintactice pentru limba L Cum se obțin resursele? Pasul 3: evaluarea text Parser sintactic limbă set de reguli pt limba română Echetare morfologică drd Radu Simionescu Detectarea grupurilor nominale drd Radu Simionescu Parsarea sintaccă drd Radu Simionescu Idenﬁcarea automată a rolurilor semance • Cine, ce, unde, când, de ce, cum face o acțiune Grupul NLP vă invită cu drag vineri, 8 mare, la o prezentare a proiectelor Metanet4U și ATLAS, pentru a vă mulțumi pentru suportul acordat • Rezultate (disponibile pe METASHARE): – Resursă adnotată cu roluri semance – Program de adnotare automată a rolurilor dr Diana Trandabăț CoRoLa – achievement of a large corpus of texts Corpus of Contemporary Romanian Language Portal Curation chain: Portal – Volunteers – Portal CoRoLa PortalCleaningMetadata 1 Title1 Character 2 Author2 Headers 3 Publication date3 Footers 4 Source4 Formulas 5 Translator5 Ta b l e s of 6 Media6 Table 7 Stylecontents 8 Domain7 Bibliography 9 ISSN/ISBNetc Processing chain: Portal • Annotations: • tokenCoRoLa • part of speechPortal • morphology • noun phrase • syntax • semantics • … TOKPOSNP pipeline Use of the corpus Acces point: RACAI, Bucharest Mirror: IIT, Iași Concordances (KWIC – Key Word In Context) … 23 Search for complex syntactic constructs in large corpora [drukola/base=permite] [drukola/s=syntrel:c i ] [] [drukola/m=pos:verb] A language processing pipeline INITIAL SUB-SYNTACTIC SYNTACTIC text PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING The document layer: processing old texts INITIAL SYNTACTIC text/SUB-SYNTACTIC PROCESSING PROCESSING image PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING The document layer: processing old texts INITIAL SYNTACTIC text/SUB-SYNTACTIC PROCESSING PROCESSING image PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING INTERPRETATIVE IMAGE image SEGMENTATION OCR TRANSCRIPTION CyRo – build a technology that interprets old Cyrillic Romanian • Train OCR classifiers to decode printed, semi-uncial and cursive Cyrillic Romanian documents • Ambitious goals of a mixt consortium – library curators – paleolinguists – image processing exp – computational linguists The sub-syntactic layer INITIAL SUB-SYNTACTIC text SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING The sub-syntactic layer INITIAL SUB-SYNTACTIC text SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING SENTENCE RECOGNIZE BORDERS TOKENIZATION POS-TAGGING LEMMAS NP CHUNKING Google Translate • Example based translation The syntactic layer INITIAL text SYNTACTIC SUB-SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING The syntactic layer INITIAL text SYNTACTIC SUB-SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING CLAUSE SYNTACTIC SEGMENTATION PARSING Train a syntactic parser on a collection of syntactic trees (treebank) Exemplu de adnotare sintaccă The semantic layer INITIAL text SYNTACTIC SUB-SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING The semantic layer INITIAL text SYNTACTIC SUB-SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING WORD-SENSE GENERATE SEMANTIC DISAMBIGUATION ONTOLOGIES RELATIONS Generate student exam tests from medical manuals 3rd year AI term project October 2017 – January 2018 Representing medical knowledge in Protégé The adult cerebral blood flow is about 750-1000 ml/min, representing 15-20% of the heart blood flow QuoVadis – a corpus of book characters and their relations • H Sienkiewicz’s Quo Vadis • A corpus and a technology – The corpus: developed with master students in CL: • enes of type person and god • relaons: coreference, kinship, aﬀecve, social – The technology: recognising enes and relaons I T A K E Unconference, 7 June 2018 A corpus semanc enes and relaons • Type of enes: – persons – gods – groups of persons and gods – body parts • Semanc relaons among enes of these types 41 Relaons • Anaphoric relaons: co-referenal; • Non-anaphoric relaons: – kinship; – aﬀecve; – social 42 Anaphoric relaons • coref • coref-interpret • member-of, has-as-member (inverse) • isa, class-of (inverse) • part-of, has-as-part (inverse) • subgroup-of, has-as-subgroup (inverse) • has-name, name-of (inverse) Example: [Lygia]1 was unable to answer, for weeping seized [her]2 anew Acte gathered [the maiden]3 to her bosom, and strove to calm [her]4 excitement coref ; coref-interpret ; coref 43 Kinship relaons • parent-of • child-of (inverse of parent-of) • grandparent-of and grandchild-of (inverse) • sibling (symmetrical) • ant-uncle-of, nephew-of (inverse relaon) • cousin-of (symmetrical) • spouse-of (symmetrical) • unknown Example: "Pardon me, Lygia For me thou art [ [of a king]2]1 and [ [of Plauus]4]3 “ child-of ; child-of 44 Social relaons • superior-of • inferior-of • in cooperaon-with • colleague-of • in compeon-with • opposite-to Example: [Petronius]1…but to [his]2 misfortune [he]3 [Cæsar himself]4, hence [he]5 roused [his]6 jealousy in compeon-with ; coref ; coref ; coref 45 Aﬀecve relaons • love • loved-by • hate • hated by • upset • friendship • worship Example: Vinicius entered Lygia's dungeon and remained there ll daylight…Both changed by degrees into sad souls with [each]1 [other]2 rec-love 46 Relaons • Anaphoric: coref John met Maria on the ski slope He raced her anafor antecedent 47 Relaons • Anaphoric: coref John met Maria on the ski slope He raced her anafor antecedent 48 Arguments and triggers in relaons • Kinship: parent-of … her father … source desnaon trigger 49 Arguments and triggers in relaons • Social: inferior-of Cesar’ s principal courers … trigger desnaon source 50 Arguments and triggers in relaons • Aﬀecve: worship Lygia dropped on her knees to implore someone else trigger desnaon source 51 Enes Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 52 Anaphoric relaons: coref Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 53 Anaphoric relaons: coref Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 54 Anaphoric relaons: coref Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 55 Anaphoric relaons: class-of Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 56 Kinship relaons: sibling Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 57 Kinship relaons: child-of Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 58 Kinship relaons: parent-of Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 59 Kinship relaons: spouse-of Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 60 Social relaons: inferior-of Petroniu… Vinicius was the son of his oldest sister , who years before had married his father , a man of consular dignity from the me of Tiberius 61 General stascs over the corpus • 7,281 sentences • 146,822 tokens, punctuaon included • 171,029 tokens summed up under all relaons • 24,636 enty menons • 22,301 referenal relaons • 755 AKS relaons (Aﬀecve + Kinship + Social) • 752 triggers 62 Example: aﬀecve relaons love and worship 63 Example: aﬀecve relaons fear-of and hate 64 Vinicius’ links with other characters 65 Semanc relaons involving Vinicius 66 The discourse layer INITIAL text SYNTACTIC SUB-SYNTACTIC PROCESSING PROCESSING PROCESSING SEMANTIC DISCOURSE PRAGMATIC PROCESSING PROCESSING result PROCESSING ANAPHORA DISCOURSE TEMPORAL RESOLUTION STRUCTURE PROCESSING Tash Aw: Map of the invisible world In books time does not flow linearly on… I T A K E Unconference, 7 June 2018 Timpul în texte adesea nu curge liniar În bucătărie, așteptând ca apa să fiarbă, Margaret se întreba cum de a sărit ea atât de natural să-l îmbrățișeze pe Adam Nu putea să-și dea seama ce a făcut-o să se simtă atât de fericită să-l vadă din nou Tash Aw: Harta lumii invisibile Segmentare ò [În bucătărie, așteptând ca apa să fiarbă, Margaret se întreba cum de] Margaret: momentul povestirii, în bucătărie, întrebându-se ò [a sărit ea atât de natural să-l îmbrățișeze pe Adam ] Margaret & Adam: în trecut, undeva, Margaret îl îmbrățișează pe Adam ò [Nu putea să-și dea seama ce a făcut-o] Margaret: momentul povestirii, în ucătărie, nu înțelege ceva ò [să se simtă atât de fericită să-l vadăb din nou ] Margaret & Adam: în trecut, undeva, Margaret se simte fericită Percepția noastră asupra timpului în romane Margaret Margaret &Adam Tash Aw: Harta lumii invisibile – o privire “ochi de pasăre” The Time Yards Model Roots… • Temporal reasoning (and temporal logic: Allen, etc ) • Much interest: – informaon extracon – queson answering – textual entailment – deciphering semanc content of texts See: ACL workshop on Spaal and Temporal Reasoning (2001), LREC workshop on Annotaon Standards for Temporal Informaon in Natural Language (2002) Related work ò Annotaon convenons/standards to cope with me as menoned in language and text ò TimeML ò TIMEX3 – explicit temporal expressions (mes, dates, duraons, etc ) ò SIGNAL – funcon words that indicate how temporal objects are to be related to each other (e g on, during, when, if, etc ) ò TLINK, SLINK, ALINK – temporal relaonship holding between events, aspectual events, events and me expressions or events and signals ò EVENT – event notaon ò TARSQI Toolkit capable to recognise: ò temporal expressions (TimeEx) ò events ò relaons between them • James Pustejovsky et al (2003) TimeML: Robust Speciﬁcaon of Event and Temporal Expressions in Text, AAAI Technical Report SS-03-07 • Verhagen & Pustejovsky: Temporal Processing with the TARSQI Toolkit, Coling 2008 We are interested in connected sequences of events • In NLP, on news – TimeLine: a representaon of events which are chronologically ordered, mainly speciﬁc to an enty (a character or parcipant in an acon, a geographical place or region) Chambers and Jurafsky (2008) Unsupervised Learning of Narrave Event Chains In ACL – StoryLine: groups of interacng TimeLines, or mergers of two or more TimeLines where the same characters or enes are taking part in the acon The text structure obtained is not related to the ﬂow of me Laparra, Aldabe and Rigau (2015) From TimeLines to StoryLines: A preliminary proposal for evaluang narraves In ACL-IJCNLP Să ne imaginăm… • … că o tehnologie TYM înregistrează TT-uri ale persoanelor care au un impact ridicat asupra evoluțiilor politice dintr-o țară – de exemplu, prin citirea și interpretarea știrilor zilnice • atunci – paginile istoriei ar putea fi redactate automat de către un agent inteligent capabil de sinteză 